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Abstract —Over these years, Correlation Filter-based Trackers 
(CFTs) have aroused increasing interests in the field of visual 
object tracking, and have achieved extremely compelling results 
in different competitions and benchmarks. In this paper, our 
goal is to review the developments of CFTs with extensive 
experimental results. 11 trackers are surveyed in our work, based 
on which a general framework is summarized. Furthermore, 
we investigate different training schemes for correlation filters, 
and also discuss various effective improvements that have been 
made recently. Comprehensive experiments have been conducted 
to evaluate the effectiveness and efficiency of the surveyed 
CFTs, and comparisons have been made with other competing 
trackers. The experimental results have shown that state-of- 
art performence, in terms of robustness, speed and accuracy, 
can be achieved by several recent CFTs, such as MUSTer and 
SAMF. We find that further improvements for correlation filter- 
based tracking can be made on estimating scales, applying part- 
based tracking strategy and cooperating with long-term tracking 
methods. 

Index Terms —Visual object tracking, correlation filters, track¬ 
ing evaluation, computer vision 

1. Introduction 

V ISUAL object tracking is one of the most challenging 
tasks in the field of computer vision and is related to 
a wide range of applications like surveillance and robotics. 
Given the initial state of a target in the first frame, the goal of 
tracking is to predict states of the target in a video. However, 
designing a fast and robust tracker is difficult according to var¬ 
ious critical issues in visual tracking, such as illumination vari¬ 
ations, occlusions, deformations, rotations and so on. Over the 
past decade, various tracking algorithms have been proposed 
to cope with these challenges, some of which use generative 
models while the others use discriminative models 0 - 

na. Generative trackers perform tracking by searching the 
best-matching windows, and discriminative methods learn to 
distinguish the target from backgrounds. In |[T3ll . O, it has 
been found that background information is advantageous for 
effective tracking, which suggests that discriminative methods 
are more competing. In particular, the correlation filter-based 
discriminative trackers have made significant achievements 
recently, and have been paid more attention by corresponding 
researchers. Therefore, summarizing the developments of cor¬ 
relation filter-based tracking algorithms and comparing them 
with other popular trackers are supposed to be conducive for 
future researches. 

Conventionally, correlation filters are designed to produce 
correlation peaks for each interested target in the scene while 
yielding low responses to background, which are usually used 
as detectors of expected patterns. Although localization tasks 


can be effectively performed by these filters, the required train¬ 
ing needs used to make them inappropriate for online tracking. 
Only after the proposal of Minimum Output Sum of Squared 
Error (MOSSE) ifTSl filter, this situation has been changed. 
Using an adaptive training scheme, MOSSE is considerably 
robust and efficient in tracking. Based on the basic framework 
of MOSSE filter, numerous improvements have been made 
later. Eor example, Henriques et al |[T^ improved the MOSSE 
filter by introducing kernel methods, and Danelljan et al El 
applied color-attributes to better represent the input data. By 
further handling the scale changes, three Correlation Eilter- 
based Trackers (CFTs), namely SAMF lUD, DSST lUD and 
an improved KCF ||20l, have achieved state-of-art results and 
have beaten all other attended trackers in terms of accuracy in 
a recent competition ED. With more CETs developed recently 
GS-ES. correlation filter-based tracking has proven its great 
strengths in efficiency and robustness, and has considerably 
accelerated the development of visual object tracking 

Despite the various correlation filter-based tracking algo¬ 
rithms proposed these years, there is no work to review them 
with comprehensive evaluations. To facilitate other researchers 
for future contributions, our fundamental goals of this paper in¬ 
clude: 1) formulating a general framework; 2) investigating the 
major developments of CETs; 3) carrying out comprehensive 
evaluations on a large scale benchmark; 4) making appropriate 
comparisons; and 5) illustrating future research directions. 

In this work, various important CETs are surveyed and their 
contributions are discussed in detail. Brief introductions of 
these studies can be found in Table II In general, training 
schemes of filters are extremely crucial in correlation filter- 
based tracking, and CETs can be further improved by intro¬ 
ducing better training schemes, extracting powerful features, 
relieving scaling issue, applying part-based tracking strategy 
and cooperating with long-term tracking. To evaluate the track¬ 
ing performance, we have collected source codes of 8 CETs 
from the internet, and implemented two CETs in a simple 
version. By running on a large scale benchmark El. El. the 
performance of tested CETs is compared with other popular 
competing trackers. The obtained experimental results are 
presented and analyzed, proving the efficiency and robustness 
of correlation filter-based tracking methods. According to the 
conducted experiments, latest CETs are demonstrated to be 
state-of-art trackers. 

The rest of this paper is arranged as follows. In Section 
mi we provide an overview of the basic framework of cor¬ 
relation filter-based tracking methods. Afterwards, theories 
and schemes for training correlation filters are introduced in 
Section [Till Eor Section [IVl numerous aspects of further im- 
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TABLE I 

Major Surveyed Papers 


Name 

Published Year 

Major Contribution 

MOSSE and Regularized ASEF \l^ 

2010 

Pioneering work of introducing correlation filters for visual tracking 

CSK 1161 

2012 

Introduced Ridge Regression problem with circulant matrix to apply kernel methods 

STC 1261 

2014 

Introduced spatio-temporal context information 

KCF 1201 

2014 

Formulated the work of CSK and introduced multi-channel HOG feature. 

CN 1171 

2014 

Introduced color attributes as effective features 

DSST 1191 

2014 

Relieved the scaling issue using feature pyramid and 3-dimensional correlation hlter 

SAMF 1181 

2014 

Integrated both color feature and HOG feature; Applied a scaling pool to handle scale 
variations 

RPAC 1241 ‘ 

2015 

Introduced part-based tracking strategy 

RPT 1251 

2015 

Introduced reliable local patches to facilitate tracking 

LCT 1231 

2015 

Introduced online random fern classiher as re-detection component for long-term 
tracking 

MUSTer I22i 

2015 

Proposed a biology-inspired framework where short-term processing and long-term 
processing are cooperated with each other 


^This abbreviation is taken from its title: Real-time Part-based visual tracking via Adaptive Correlation filters. 


provements are reviewed and discussed in detail. Afterwards, 
experimental results are presented and analyzed in Section [Vl 
In the end, conclusions and future trends are summarized in 
Section IVD 

II. Correlation Filter-based Tracking Framework 

According to the existing correlation filter-based tracking 
methods, the general working framework can be summarized 
as follows. Initially, correlation filter is trained with image 
patch cropped from a given position of the target at first frame. 
Then in each subsequent time step, the patch at previous 
predicted position is cropped for detection. Afterwards, as 
shown in Figure [H various features can be extracted from 
the raw input data, and a cosine window is usually applied 
for smoothing the boundary effects. Subsequently, efficient 
correlation operations are performed by replacing the ex¬ 
hausted convolutions with element-wise multiplications using 
Discrete Fourier Transform (DFT). In practice, the DFT of 
a vector is computed by the efficient Fast Fourier Transform 
(FFT) algorithm. Following the correlation procedure, a spatial 
confidence map, or response map, can be obtained using 
inverse FFT. The position with a maximum value in this map 
is then predicted as the new state of target. Next, appearance 
at the estimated position is extracted for training and updating 
the correlation filter. Because only the DFT of correlation filter 
is required for detection, training and updating procedures are 
all performed in frequency domain. 

To describe the workflow mathematically, let x be the input 
of detection stage and h be the correlation filter. In practice, x 
can be either raw image patch or extracted features. Suppose 
the symbol" represents the Fourier transform of a vector. Ac¬ 
cording to Convolution Theorem, circulant convolution equals 
to element-wise multiplication in frequency domain 

x(8)h = J-i (xOh*) (1) 

where is inverse Fourier transform operation, © denotes 
element-wise multiplication and * means the complex conju¬ 
gate. The results of ([T]) are the expected correlation output 
between x and h, which also form the mentioned confidence 
map. 


For training the filter, let us first define a desired correlation 
output y. Using the new instance x' of target, correlation filter 
h should satisfy: 

y = ^-1 (x' © h*) (2) 

and thus: 



where y is the DFT of y and the division is computed element¬ 
wise. 

In terms of computation cost, the complexity of circu¬ 
lar convolution for an image of size n x n is 0{n^) , 
while the element-wise multiplications using FFT only require 
0{in?\ogn). Therefore the acceleration brought by FFT is 
significant. 

However, there are some issues that should be handled well 
when using the correlation filter-based tracking framework. 
First, training schemes are extremely crucial for CFTs. Since 
the target may change its appearance continuously, correlation 
filters should be adaptively trained and updated on-the-fiy 
to adapt to the new appearance of target. Second, feature 
representing methods also greatly infiuence the performance. 
Although raw pixels can be directly used for detection, the 
tracker may be affected by various noises like illumination 
changes and motion blurs. More powerful features are sup¬ 
posed to be helpful. Moreover, how to adapt to the scales 
of target is another challenging problem for CFTs. Since 
the sizes of correlation filters are usually fixed in tracking, 
scale variations of the target cannot be handled well in these 
trackers. As a result, an effective scale estimation approach 
is supposed to complement this shortage of correlation filter- 
based tracking. Furthermore, long-term tracking is believed to 
be the weakness of many CFTs since they commonly lack 
the ability to re-locate the target after drifting. By cooperating 
with long-term tracking methods, CFTs can be much more 
robust in tracking. 

III. Training Schemes for correlation filters 

The behaviors of correlation filters can be diversified if 
different methods are used for training.To train a robust 
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Fig. 1. General workflow for typical correlation filter-based tracking methods. At each frame after initialization, an image patch at previous estimated position 
is cropped as current input. Subsequently, visual features can be extracted for better describe the input, and a cosine window is usually applied for smoothing 
the discontinuities at window boundaries. Afterwards, correlation between current input and the learned Alter is performed in frequency domain based on 
Convolution Theorem. The symbol O in the figure denotes element-wise computation, and FFT means Fast Fourier Transform. After the correlation, a spatial 
confidence map is obtained by Inverse FFT (IFFT), whose peak can be predicted as the new position of target. Lastly, appearance at the newly estimated 
position is extracted for training and updating the correlation Alter with a desired output. 


correlation filter for online visual tracking, numerous studies 
have been proposed. 


A. Traditional Training Methods 

For the simplest case, template cropped from an image 
can be used to produce peaks for the target. However, their 
responses to background patterns are also relatively high. To 
overcome this issue, a variety of correlation filters GH-ED 
were trained by suppressing responses to negative training 
samples while maintaining high response to the target. The 
main difference among these filters is the methods they are 
constructed with the collected training samples. For example. 
Synthetic Discriminant Functions (SDF) (271, (321 . Optimal 
Tradeoff Filters (OTF) and Minimum Average Corre¬ 
lation Energy (MACE) (29l are trained with enforced hard 
constraints so that peaks would always be produced in the 
same height. On the contrary, hard constraints are believed 
to be unnecessary in other filters, such as Maximum Average 
Correlation Height (MACH) (^ and Unconstrained MACE 
(UMACE) (3ll. These filters are trained by relaxing the hard 
constraints. More details about developments of correlation 
filters can be found in the survey (^ . Recently, a correlation 
filter, which is named as Average of Synthetic Exact Filters 
(ASEF) Gl, averages all the trained exact filters to obtain a 
general one. The resulted filter has shown to perform well in 
eye localization (34ll and pedestrian detection (35ll . Although 
ASEF may be robust enough to be applied in visual tracking, 
a large number of samples are required for training, which 
makes it too slow for online tasks. 


B. Adaptive Correlation Filters 

To train correlation filters more efficiently, a novel filter 
termed as Minimum Output Sum of Squared Error (MOSSE) 
was developed by Bolme et al. (TSll . together with an improved 
version of ASEE. 

1) MOSSE: According to Q and (3]), a simple filter can be 
obtained on sample x with the corresponding desired output y. 
However, more samples are needed to improve the robustness 
of correlation filters. To properly map these input samples to 
desired outputs, MOSSE finds a filter h by minimizing the sum 
of squared error between actual correlation outputs and desired 
correlation outputs. By computing in frequency domain, this 
minimization problem can be expressed by: 

miny^ ||xj O h* - Yill^ (4) 


where i indexes each training image. Then the solution of h* 
is given by: 


h* 


Ei y» Q X* 

Ei^iOx* 


(5) 


whose detailed derivations can be found in da. 

In general, the desired output y can take any shape. In 
MOSSE, it is generated from ground truth with a compact 2D 
Gaussian shaped distribution whose peak is at the center. If 
Kronecker delta function is used for defining y, whose value at 
target center is one and values elsewhere are zero, the resulted 
filter is theoretically a UMACE filter mentioned above. Thus 
UMACE is a special case of MOSSE. 

2) Regularized ASEF: By slightly modifying the original 
form, ASEE is also capable of efficient tracking. Using one 
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sample at a time, a filter called exact filter can be found by 
solving 0: 

h* = (6) 

Xi © X* 

Then a more general filter can be produced by averaging 
all the computed exact filters: 




N ^ 0 X* 


(7) 


However, original ASEF can be much unstable because 
the denominator in © may be extremely small. To help 
produce a more stable filter, a regularization parameter e can 
be introduced in the denominator to prevent it from being 
a close-to-zero number, which has shown to be effective for 
stabilization. 


C. Kernelized Correlation Filters 

After the success of Ea, El, correlation filter-based 
tracking framework has shown to be significantly efficient 
for robust tracking. However, the overall performance may be 
limited because the ASEF and MOSSE filters can be viewed as 
simple linear classifiers. By taking advantage of kernel trick, 
correlation filters are supposed to be more powerful. 

There have already been some studies ||36l-|[38l to apply 
kernel methods in correlation filters. According to (361, (771 . 
it has been found that filters which do not use the power 
spectrum or image translations are easier to be kernelized. 
Different from these studies, Henriques et al. ca, gqi pro¬ 
posed that correlation filters can be effectively kernelized with 
the introduction of Ridge Regression problem and circulant 
matrix. 

1) Ridge Regression Problem: By considering correlation 
filters as classifiers, they can be trained by finding the relation 
between i-th input x^ and its label yi from a training set. 
Suppose the relation takes the form /(x^) = yi, training 
problem can be viewed as minimizing the objective function: 

min VL(/(’w,Xi) ,yi) + A||’w|p (8) 

W 

i 

where w denotes the parameters, A is regularization parameter 
to prevent overfitting, and L(-) is loss function. In SVM, 
!/(•) is defined by hinge loss I/(/(w, x^), = max(0,1 — 

^i/(w,Xi)), while Regularized Least Squares (RES) which 
uses quadratic loss I/(/(w, x^), = {yi — f {w^ can 

be alternatively applied for training filters. It has been shown 
that training by RES can deliver equivalent performance with 
hinge loss (39l . The RES is also known as Ridge Regression. 

For the function /(x^), it can be a linear operation /(x^) = 
(w, Xi) + 6 where (•, •) is dot product and b is constant offset. 
By solving ([5]), the parameter w can be given in a closed form 

E9l: 

w = {X'^X + XI) X^y (9) 

where X is a matrix whose rows are training samples, y is a 
vector of corresponding labels, and I is identity matrix. It is 
worth noting that if the computation is performed in frequency 
domain, should be replaced by the Hermitian transpose 
of X in ®, which is X^ = (X*)^. 


To introduce the kernel functions for improving perfor¬ 
mance, input data x can be mapped to a non-linear-feature 
space with (p(x), and w can be expressed by linear combi¬ 
nation of the inputs w = Then /(x^) takes the 

form: 

n 

/(^*) = 

i=i 

where /^(xi, Xj) = ((p(xi), (p(xj)) is the kernel function. Sup¬ 
pose K is the kernel matrix with its elements Kij = /^(xi, Xj). 
The solution of ^ using kernel functions can be given by (^ : 

cx = {K + XI)-^y (11) 


where / is identity matrix. To avoid difficulty in computing 
inverse matrix of CHi, circulant matrix can be introduced. 

2) Circulant Matrix: Generally, samples are obtained by 
random sampling 0, Eol-illll. With the help of circulant 
matrix, however, all the translated samples around the target 
can be collected for training without sacrificing much speed. 

With a base sample x = (xq, ..., ^n-i), a circulant matrix 
X has the following form: 


X = C(x) 


/ Xo Xi 
^n—l ^0 


Xn-l \ 
Xn—2 


( 12 ) 


\ Xi X2 ... Xo J 


There are various interesting properties of circulant matri¬ 
ces. For example, their sums, products and inverses are also 
circulant. In addition, a circulant matrix can be made diagonal 
with the DFT of its base vector x ill: 


X = Fdiag(x)F^ (13) 


where F is DFT matrix, which is used for computing the DFT 
of an vector X(z) = ^^Fz,. Then the solution of w can be 
expressed in the form: 


w = Fdiag 


X* 0 X + A 


F^y 


(14) 


which is equivalent to a simpler form in frequency domain: 


x*0y 

w = ^-- 

X* 0 X + A 


(15) 


where the division is performed element-wise . Similarly, a 
can also be computed efficiently if the kernel matrix K is 
circulant: 


OL = F 


(diag(k 


+ A 


F^g 


(16) 


where k is the base vector of circulant matrix K, and further: 


a = 


k +A 


(17) 


where the division is also element-wise. 

It has been proven that the kernel function of a circulant 
kernel matrix should be unitarily invariant (detailed proof can 
be found in (T^ . (2Qll ). Since dot-product and radial basis 
kernel functions are found to satisfy this condition, polynomial 
kernels and Gaussian kernels are usually applied. 
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If the kernel k is computed between x and x', a polynomial 
kernel = (x^x' + a)^ can be expressed as: 


Since there is no direct expression of P (p|c(p'), o), let us 
define a function to describe it: 


= (jr-i(x*©x') + a)'’ (18) 


-P(p|c(p')>o) =/i(p-p') (23) 


and the Gaussian kernel k^^' = exp (— ^(||x — x'|p)) can 
be computed by: 

k-' = exp (||x||2 + ||x'||2 - 2^-1 (r © xO)) 

(19) 

All the derivations of equations ^ to (O can be found in 

ca, EOl. 

3) Detection: In a new frame, the target can be detected 
by the trained parameter ol and a maintained base sample x. 
If the new sample is z, a confidence map y can be obtained 
by: 

y = C'(k^^)a = (k’^^ed) (20) 


where h can be some operations which take the difference of 
two vectors p and p' as its input. To relieve the ambiguities 
caused by similar objects in the neighborhood, h should not be 
radially symmetric. In other words, h{p — p') and /?^(|p — p'|) 
should not equal to each other. 

For P (c(p')|o), it can be defined as: 

P{c{p')\o) = /(p')w<t(p' - Po) (24) 

where /(•) is image intensity and cCcr(') denotes weighted 
Gaussian function defined by: 

w<t(p' - Po) = aexp f-T||p' - po||2 j (25) 


Similar with ASEF and MOSSE, the position with a maxi¬ 
mum value in y can be predicted as new position of the target. 


D. Dense Spatio-Temporal Context Tracker 

Spatio-Temporal Context (STC) tracker proposed by ll^ 
was developed to exploit the use of context information. We 
consider it as another GET since it follows a similar workfiow 
described in Section [III 

Context information has already been considered in various 
trackers Il45l-ll48l. In the majority of these studies, key points 
around the target are first extracted and then descriptors like 
SURE and SIET are introduced to describe these consistent re¬ 
gions. However, crucial information can be ignored sometimes 
by these methods and they are also quite time-consuming. 
Therefore the fundamental goal of STC is to use context 
information more efficiently. 

Instead of training by optimizing, STC is designed to learn a 
likelihood distribution, which is defined as the prior possibility 
of object locating in position p (p G R^): 

^(p) = P{p\o) (21) 

where £{') means likelihood and o is the object present in the 
scene. 

Let Po denote the position of targets center, and i?c(p) 
denote the neighboring coordinates around po. Then a context 
feature set can be defined by = {c(p') = (/(p'), p')| ^ 

i?c(Po)} where /(p') represents the image intensity at position 
Po. By marginalizing the likelihood distribution of c(p') given 
o: 


^(p) = -P(p|o) 

= P, ^(P>c(p')|o) 

c(p')eP^ 

= El -P(p|c(p')>o)-P(c(p')|o) (22) 

c(p')eP^ 

where P(p|c(p'),o) models the relationship between spatial 
context information and target location, and P (c(p')|o) mod¬ 
els the appearance of object. 


where a is a normalization parameter. Given this Gaussian 
distributed weights, contexts closer to the center of object are 
assigned with larger values while further contexts are assigned 
with smaller values. Therefore the tracker pays more attention 
on the central area. 

Eor training the h{p — p'), a desired output distribution ^(p) 
can be designed by hand. If the object is known to be at the 
center of scene, ^(p) can be defined by: 

^(p) = P(p|o) = 6exp(^-||^^^f^ (26) 

where b is also a normalization parameter, a is scale parameter 
and /3 controls the shape of this distribution. Subsequently, we 
have: 


£{p) = 6exp(^-||P^r) 

= E ^(p-p')-^(p')‘^<t(p'-P o) 
c(p')GP^ 

= /i(p) 0 (7(p)w^(p - Po)) (27) 


By introducing Convolution Theorem, we have: 

/^(6exp(-||E^r))\ 


Hp) = P- 


J’(/(p)Wo(p-Po)) 


(28) 


where division is performed element-wise. With a trained 
h{p), £{p) of the new frame can be calculated by: 


^{p) = P ^{P{h{py QP{l{pfua{p-pl ^))) 

(29) 

where t represents current time step. 

Similarly, a position p with the maximum value in ^(p) can 
be viewed as the new position of the object. 


E. Updating Scheme 

According to the introduced training schemes, each frame 
can produce a correlation filter, thus the strategy of combining 
it to existing trained filter is crucial for constructing a robust 
appearance model. 
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In CFTs, running average is usually applied for updating, 
though different algorithms may average over different com¬ 
ponents. For regularized ASEF, a general correlation filter is 
updated by averaging every learned exact filter: 


K=v 


Xt © x| + e 


+ il-v)hU 


(30) 


where t denotes the t-th frame and r] is learning rate. STC 
also updates its filter based on the form of ([30l) . 

Instead, MOSSE respectively averages the numerator and 
the denominator of ©: 


= v{yt © x*) + (1 - ri)At-i (31) 

Bt = 7?(xt©x*) + (l-?7)Bt_i 


For KCF, the dual space coefficients a can be updated in 
frequency domain: 


supported, the linear kernel, which is called Dual Correlation 
Filter (DCF) ll20ll . can be trained by only using a single 
sample. The general case which use several multi-channel 
samples to train filters requires expensive computation costs, 
it is inappropriate in online visual tracking. 

The differences between STC and other introduced training 
schemes include the following aspects. First, STC is developed 
to model the relationships between the object and its local spa¬ 
tial contexts, while common CFTs model the input appearance 
with trained filters. Second, values of the confidence map in 
STC can be referred to as prior probabilities given the current 
object, while values in confidence maps of other CFTs are 
correlation scores. Third, the algorithm of STC has the ability 
of estimating scale variations, which is difficult for CFTs like 
MOSSE and KCF. More arguments can be found in ||26l. 

IV. Further Improvements 


at = 7] ^ + (1 - (32) 

kt + A 

whose kt is averaged by: 

kt = 7?k^^ + (1 - ?7)kt-i (33) 

where z is the new sample extracted from currently predicted 
position. 

Generally, CFTs use similar updating schemes described 
above, and sometimes slight modifications can be made to 
improve the performance. Given an example, Danelljan et 
al ifTTll modified the updating scheme of CSK tracker ifT^ 
(original version of KCF) by using the cost function with 
weighted average quadratic error for training. Moreover, robust 
updating schemes can also be achieved by considering long¬ 
term tracking. If the target is lost or occluded, learning the 
appearance model is obviously harmful. To avoid learning the 
false positive samples, some studies have introduced long¬ 
term components with failure detection schemes For 

instance, the tracker of ll^ stops updating if occlusions are 
detected, and the tracker of ll22l refreshes the correlation filter 
if the prediction of long-term component is more confident. 
Experiments have shown that the detection of occlusions is 
extremely beneficial. 

F. Comparisons of Different Training Schemes 

Training schemes discussed in this section include ASEE, 
MOSSE, Kernelized Correlation Eilter (KCE) and STC. In 
general, they all follow the workfiow described in Section 
mi where computations based on Convolution Theorem are 
employed for detection and the filter is trained with a desired 
output. However, there are some differences among these 
CETs. 

Eor ASEE tracker, its filter is produced by averaging over all 
the learned filters, while MOSSE filter is trained by averaging 
over all the images. By introducing Ridge Regression problem 
and circulant matrices, kernelized correlation filters can be 
introduced for tracking. Theoretically, the linear kernel in KCE 
can be the same with MOSSE filter if multiple samples of 
single channel are used for training. If multiple channels are 


Instead of proposing a novel training scheme of correlation 
filter, there are various aspects that can improve the robustness 
of CETs. Over the years, improvements have been mainly 
made on representing features, handling scale variations, ap¬ 
plying part-based strategy and cooperating with long-term 
tracking. 

A. Feature Representation 

In earlier CETs like MOSSE and CSK, raw pixels are 
directly used for tracking. However, noises brought by raw 
images extremely limit the tracking performance. Although 
shifting the input data to a zero-mean distribution or mul¬ 
tiplying it with Hanning window can slightly resist these 
noises, more powerful features are still needed for further 
improvements. 

Apparently, features with multiple channels can be more 
representative and informative. In KCE, integrating them is 
simple and efficient. Eor Gaussian kernel function, vectors 
from different channels can be simply added together: 

kxx' = exp (-T (||x||2 + ||x'||2 - 2^-1 (E,*: ® *c))) 

(34) 

where c denotes the number of channels. With the multi¬ 
channel kernel functions, the famous HOG feature m has 
been successfully applied in KCE trackers with superior per¬ 
formance. 

Besides HOG, color attributes are also believed to be 
beneficial ifTTll . Color attributes ll50l . or Color Names (CN), 
are the names of different colors defined by humans. In 
English, it has been concluded that there are 11 basic color 
terms ll50l . which include white, black, blue and so on. 
A map between the RGB combinations and linguistic color 
attributes can be found in (501, which is trained with images 
retrieved from Google-image search. Using the map, RGB 
values can be associated with a probabilistic 11 dimensional 
color vector with unit length. By further proposing an adaptive 
dimensionality reduction technique, the resulted tracker can 
achieve state-of-art accuracy with a considerably high speed, 
which is over 100 EPS. 
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To some extents, two features are complementary to each 
other. HOG feature is mainly applied for analyzing the image 
gradients, while CN feature focuses on color representations. 
Based on the efficiency of integrating multi-channel data in 
([^ . both HOG feature and CN feature can be fused together 
to facilitate robust tracking |[T^ . 

B. Handling Scale Variations 

Conventional CFTs, such as MOSSE and KCF, mainly 
employ fixed-sized windows for tracking, and they are unable 
to deal with target changes. To handle the scale variations, 
numerous algorithms have been proposed. 

In SAME and DSST, a searching strategy is applied to es¬ 
timate scales of the target. In specific, windows with different 
sizes are sampled around the target, and are correlated with 
the learned filter. Subsequently, the window with the highest 
correlation score can be predicted as the new state. 

Suppose the size of a window i is denoted by a 2- 
dimensional vector s^. Let So denote a template window size, 
then we can have = a^So where Oi denotes a scale parameter 
given by a scaling pool S = {ai, a 2 ,..., uat} of N positive 
numbers. As shown in Figure [2l the size of current target 
can be estimated by searching the window with a maximum 
correlation score among the sampled windows. In SAME, S 
is set by constant values ranged from 0.985 to 1.015, and 
in DSST, S = {a"|n = • • • > L^J }• The major 

difference between the two trackers is that SAME processes 
one window at a time while a 3-dimensional correlation filter 
is employed to search the best scale in DSST. Some recent 
CFTs (221 and (231 also apply the scaling pool method. 


Scaling Pool 



Fig. 2. Workflow of estimating scales using the searching strategy. Each time 
a new frame comes, windows with different sizes are cropped around previous 
position. By being correlated with a trained correlation Alter h, corresponding 
confldence maps mi(i = 1,..., N) can be obtained. Then the window that 
can produce maximum confldence score is estimated as the new scale. 

Different from using scaling pool, scale varying issue can 
also be solved by some part-based tracking methods. In (24l . 
the whole target is determined through a Bayesian framework. 
In another part-based tracker (251 . a statistical method is 
employed, which records and sorts the variations of relative 
positions between different sub-patches to estimate scales. In 
practice, the performance of these methods is also promising. 

In addition, it is worth mentioning that STC has its own 
scheme to deal with scale variations. According to the formu¬ 
las used in Section IIII-Di suppose the new estimated center of 


the target is Po, and i{po) is its computed confidence score. 
Then scales can be estimated by: 


I ^{{Po)t) 

^ ((Po)t-l) 


(35) 


where sj is the predicted scale at time t. To smooth the 
predictions, estimated scales are averaged over n consecutive 
frames, and linear interpolation is used for prediction: 


fs = TEr=i«u_ 

\ 5t+i — (1 ~ + As 


(36) 


where A is a fixed parameter. With the estimated size of the 
target, the parameter a of weight function ([25l is also required 
to be updatecH: 

cTt+i = St(Jt (37) 


While conducting the experiments, we found that the esti¬ 
mation in STC is sometimes unstable, because the computed 
s[ can be extremely large if the denominator of (1^ is close 
to zero. 


C. Part-based Tracking 

Instead of learning a holistic appearance model, various 
part-based tracking algorithms ED-ia have been proposed, 
in which the target is tracked by its local appearance. If 
the target is partially occluded, its remaining visible parts 
can still represent the target and thus the tracker is able to 
continue tracking. In ca, M, experimental results have 
shown that the local representations are effective for object 
tracking. Therefore introducing part-based tracking strategy in 
CFTs is supposed to be advantageous. 

Recently, (2411 . (25]| have made successful attempts to apply 
part-based tracking strategy to CFTs. In (24)1 . 5 parts of the 
target are independently tracked by KCF trackers. When a 
new frame comes, confidence maps of these tracked parts are 
first computed. By assigning adaptive weights to these maps, a 
joint map can be constructed to predict new state using particle 
filter method. Another tracker, which is called Reliable Patch 
Trackers (RPT) ll25]| . also exploits the use of local contexts 
and treats KCF as its base tracker. However, the tracked 
parts in RPT are automatically selected by sampling, whose 
reliabilities are estimated on-the-fiy. A reliable patch is defined 
as being trackable and sticking on the target. If a part is no 
longer reliable, it will be discarded and re-sampled around the 
target. After obtaining the tracking results of reliable patches, 
new state of the target is predicted by a Hough Voting-like 
scheme. 

In general, part-based tracking strategy can be helpful to 
gain robustness against partial occlusions. The main difficulty 
of developing a robust part-based CFT is how to design 
an appropriate mechanism to handle multiple results from 
different tracked parts. According to the introduced studies, 
particle filter method has proven to be an effective solution. 

^Detailed derivations of ISJ, ([36) and {37j can be found at 
http://www4.conip.polyu.edu.hk/~cslzhang/STC/STC.htm 











D. Long-term Tracking 

One other vital challenge in visual tracking is the absence 
of the target. If the target partially or fully disappears from the 
view, conventional CFTs can be easily distracted by irrelevant 
objects because they do not contain a long-term component. 
As a consequence, introducing long-term tracking methods is 
believed to be favorable for improving correlation filter-based 
tracking methods. 

For long-term tracking, there exists several studies EH, 
Ea-ED, some of which introduce a re-detection module 
while the others attempt to learn conservative appearance of 
the target. For example, TLD tracker (13 trains a detector 
with an expert of false negative samples and an expert of 
false positive samples. If the tracking module in TLD fails, 
this trained detector can then re-initialize the tracker. On the 
other hand, the tracker of 1591 conservatively learns the target 
appearance from reliable frames with a self-paced learning 
scheme. Regarding to CFTs, two recent studies ||22l, 1^ have 
successfully cooperated CFTs with long-term tracking. 

Inspired by a biological memory model called Atkinson- 
Shiffrin Memory Model (ASMM) ifel . a MUlti-Store Tracker 
(MUSTer) based on a cooperative tracking framework was pro¬ 
posed fT2 \ . In ASMM, there are short-term memory and long¬ 
term memory in human brains. Short-term memory, which 
updates aggressively and forgets information quickly, stores 
local and temporal information, while long-term memory, 
which updates conservatively and maintains information for a 
long time, retains general and reliable information. With short¬ 
term and long-term memory working together, both efficiency 
and robustness can be achieved. By considering CFTs as 
efficient short-term trackers, introducing long-term tracking 
methods is supposed to complement the shortage of CFTs. In 
MUSTer, the long-term part is a key points-based method. In 
the course of tracking, key points of the target are maintained 
or discarded based on a forgetting curve, and then retrieved for 
locating the target if the CFT fails. Experiments have shown 
that MUSTer have surpassed various state-of-art trackers in 
different benchmarks. 

Another method of introducing long-term tracking was 
proposed by Ma et al 1^ . In this method, a re-detection 
component is added into the tracking system. Similar to TLD 
tracker, the re-detection procedure is carried out based on 
an online random fern classifier, whose training samples are 
collected by a k-nearest neighbour (kNN) classifier. With this 
cooperation, the resulted tracker, namely LCT, has shown to 
be able to handle well with long term tracking. 

V. Experiments 

In this section, both quantitative and qualitative experiments 
have been conducted on large scale benchmarks to evaluate 
the advantages brought by correlation filter-based tracking 
framework. To carry out comprehensive and fair comparisons, 
additional 29 popular trackers are evaluated as well, and 
parameters of all the evaluated trackers are set as default and 
fixed during the experiments. The hardware we have used in 
the evaluation is a cluster node (3.4GHz Intel Xeon CPU, 8 
cores 32GB RAM). 


A. Experiment Setup 

1) Compared Trackers: Eor CETs, trackers with currently 
available source codes are selected in our evaluations (except 
ASEE and MOSSE are implemented by ourselves), which 
are regularized ASEF JBll . 1^ . MOSS^ O, CSK (with 
raw pixels) CSl, KCljfKwith HOG features) O, ChU (13, 
DSStEI dll, SAMI0 (IHl, STC0 (ai, MUSTeil (23 and 
RPT[^ (251. Other competing trackers used for comparisons 
include 28 trackers from the code library of Online Object 
Tracking Benchmark ipoTBiii] ns, (Ml and a recent state- 
of-art tracker MEEIVO (631 . 

2 ) Test Sequences: All the test sequences in our evaluation 
come from 00TB (131, (131 • In original 00TB, there are 51 
different tracking tasks with fully annotated attributes, which 
include scale variations, illumination variations, rotations and 
so on. Then in later 00TB, the number of tasks has been 
extended to 100. In our experiments, original 00TB is mainly 
used to compare trackers since it is more representative. 
The later 00TB is used as extended dataset to verify the 
performance of CETs. 

3) Evaluation Methods: Eollowing the protocol proposed 
in ini, One-Pass Evaluation (OPE), Temporal Robustness 
Evaluation (TRE) and Spatial Robustness Evaluation (SRE) 
are performed in our evaluation. OPE is a traditional evaluation 
method which runs trackers on each sequence for once. 
Eor TRE, it runs trackers on 20 sub-sequences segmented 
from the original sequence with different lengths, and SRE 
evaluates trackers by initializing them with slightly shifted or 
scaled ground truth bounding boxes. With TRE and SRE, the 
robustness of each evaluated trackers can be comprehensively 
interpreted. 

After running the trackers, precision plots and success 
plots are applied to present results. Instead of using average 
Euclidean distances between the predicted centers to ground- 
truth centers, precision plots show percentages of frames 
whose estimated locations lie in a given threshold distance to 
ground-truth centers. Regarding to success plots, an overlap 
score is introduced to represent performance. Let Vt denote 
the area of tracked bounding box and denote the ground 
truth. An Overlap Score (OS) can be defined by S = 
where H and H are the intersection and union of two regions, 
and I • I counts the number of pixels in the corresponding 
area. Afterwards, a frame whose OS is larger than a threshold 
is termed as a successful frame, and the ratios of successful 

^Original implementation can be found in 

http://www.cs.colostate.edu/~vision/ocof_toolset_2012/index.php 

^ http://home.isr.uc.pt/~ henriques/circulant/ 

^http://liu.diva-portal.org/smash/record.jsf?pid=diva2%3A711538&dswid=- 

7492 

^ https://github.com/gnebehay/DSST 

^ https://github.com/ihpdep/samf 

^ http://www4.comp.polyu.edu.hk/~cslzhang/STC/STC.htm (Note that STC 
may crash during the scale estimations, thus we simply fix the parameters 
before the detected errors) 

" https ://sites. google. com/ site/zhibinhong4131 /Proj ects/muster 
https://github.com/ihpdep/rpt 

^ ^ http://cvlab.hanyang.ac.kr/tracker_benchmark/ 
http://cs-people.bu.edu/jmzhang/MEEM/MEEM.html 
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Precision plots of OPE 



Precision plots of SRE 


Location error threshold 
Success plots of OPE 




Precision plots of TRE 


Location error threshold 
Success plots of SRE 




Fig. 3. Plots of OPE, SRE and TRE in OOTB. Trackers with best 10 scores are presented in the legends. 



Fig. 4. Attribute-based success plots of SRE. The trackers with best 10 AUC scores are presented in the legends. 


frames at the thresholds ranged from 0 to 1 are plotted in 
success plots. 

Furthermore, trackers are ranked in both plots, and first 
10 are presented. In precision plots, the precisions at the 
threshold of 20 pixels are used for ranking, while Area Under 
Curve (AUC) is used for ranking in success plots. Since AUS 


calculates overall performance, it is more representative for 
estimating the robustness of trackers. 

B. Quantitative Evaluation 

1) Overall Performance: Using 51 sequences from OOTB, 
the overall performance of all the 39 trackers are obtained and 
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Fig. 5. Extended results for 10 CFTs and 4 indicate trackers. Plots of OPE, SRE and TRE in OOTBIOO are presented, as well as the scores and ranks of all 
the 14 trackers. 


TABLE II 

Additional results of OPE. Average center location errors (CLE), average overlap scores (OS) and average speeds (FPS) are 

PRESENTED. IN EACH ENTRY, SCORE IN THE TOP COMES FROM OOTB WITH 51 TASKS, WHILE THE SCORE IN THE BOTTOM COMES FROM OOTBIOO. 

The first and second best scores are highlighted by bold and underline. 


Tracker 

MUSTer 

RPT 

SAMF 

MEEM 

DSST 

KCE 

ASLA 

Struck 

TLD 

CSK 

MOSSE 

STC 

CN 

ASEE 

CLE 

17.3 

36.7 

28.9 

22.3 

41.3 

35.5 

73.1 

50.6 

48.1 

00 

00 

bo 

82.8 

68.3 

64.8 

91.6 


31.7 

36.3 

35.9 

28.8 

50.8 

45.3 

75.5 

49.6 

54.9 

306 

99.6 

80.4 

81.8 

130 

OS(%) 

65.0 

58.2 

57.9 

57.9 

56.1 

51.9 

43.9 

47.8 

44.1 

40.1 

31.8 

35.1 

44.8 

29.9 


58.3 

53.9 

54.8 

53.6 

52.2 

48.0 

40.6 

45.9 

42.4 

38.4 

29.1 

31.2 

42.4 

27.4 

FPS 

3.85 

3.70 

15.8 

19.3 

32.7 

I9I 

7.48 

20.4 

33.3 

288 

281 

580 

142 

324 


3.94 

3.63 

I5.I 

20.0 

31.5 

183 

4.75 

20.7 

40.9 

282 

284 

578 

132 

320 


presented in Figure [51 

According to the presented results, it can be found that 
CFTs, such as MUSTer, RPT and SAMF, perform considerably 
well in these plots. Particularly, the recent proposed tracker 
MUSTer has outperformed other trackers in all the success 
plots (64.1% in OPE, 56.4% in SRE and 61.7% in TRE). In 
precision plots, MUSTer can also achieve state-of-art results 
despite that MEEM has a 1% higher score in TRE. 

Moreover, there are six improved CETs that can always 
carry out top-10 performance, which indicates that these CETs 
are considerably robust in tracking. 

2) Attrihute-hased Evaluation: With the annotated at¬ 
tributes of each sequence, the performance of evaluated track¬ 
ers with respect to different challenges is revealed. These 
involved challenges are mainly caused by three factors, which 
are varying appearance of the target, severe surrounding envi¬ 
ronments and the limitations of the cameras. The SRE results 
of success plots are presented in Eigure [U 

The challenges brought by varying appearances of the target 
are scale variation, out-of-plane rotation, in-plane rotation, 
deformation, and fast motion. In scale variations, MUSTer, 
DSST and SAME perform the best with around 50% overlap 
score, which suggests that the searching strategy introduced 


in Section IIV-BI is effective. In the meantime, the score of 
RPT (48.8%) proves that its scale estimation scheme is also 
applicable. Eor rotations and deformations, MUSTer is shown 
to be the most robust tracker, and RPT, SAMP, MEEM also 
perform well. While in the fast motion evaluation, the best 
score is carried out by MEEM (52.7%), which is 2.4% higher 
than the score of MUSTer. As a conclusion, the improved 
CETs, which are MUSTer, RPT, DSST and SAMP, can learn a 
considerably robust appearance model, while MEEM is shown 
to be better at locating the fast moving targets. 

Another group of challenges are caused by surrounding 
environments, including illumination variations, occlusions 
and background clutters. It can be found that CETs still 
produce better results and can exclude other trackers from 
the first two ranks, which implies that background contexts 
can be efficiently identified to help avoid distractions using 
correlation filter-based tracking algorithms. 

The rest of the evaluated attributes are motion blur, out of 
view and low resolution, which are mainly brought by the lim¬ 
itations of cameras. In motion blur and out of view challenges, 
the multi-expert restoration scheme has made MEEM the most 
robust among tested trackers, while MUSTer maintains the first 
rank in low resolution. It is worth mentioning that RPT has 
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dropped a lot in low resolution tracking, which only achieves 
34.3% success rates and 6th rank in top 10. This may suggest 
that tracking with local appearance need high resolution for 
better performance. 

To sum up, recent CFTs such as MUSTer, RPT, SAMF, 
DSST and KCF can all perform well in various challenging 
tasks, and the strengths of using correlation filters for tracking 
are shown to be significant. In particular, MUSTer becomes the 
most robust tracker in our evaluation since it wins 8 challenges 
out of 11 challenges. 

3) Extended Experiments: For a better interpretation of 
tested CFTs, a larger 00TB with 100 sequences, which can 
be denoted as 00TB100, is used for extended experiments. 
The overall performance can be found in Figure [5] To better 
represent the performance of CFTs, 4 other competing trackers 
are additionally selected as indicators. These selected trackers 
are Struck and ASLA, which use a single online classifier for 
tracking, and MEEM and TED, which apply multiple online 
classifiers. 

According to the presented results, some aspects of suc¬ 
cessful improvements on CETs can be revealed. Eirst, the 
training scheme of MOSSE, which averages over all the 
samples, is shown to be better than training scheme of ASEE, 
which averages over learned filters. Second, spatial contexts 
used in STC are also shown to be beneficial. Eurthermore, 
introducing kernel methods has made CSK and KCE much 
more competing, and has helped CETs like MUSTer and 
SAME become state-of-art trackers. On the other hand, color 
attributes used in CN tracker and HOG feature used in KCE are 
also shown to be advantageous. By integrating both features, 
SAME further improves the overall performance. Moreover, 
improvements based on relieving the scaling issue, applying 
part-based strategy and introducing long-term tracking are 
proven to be effective as well. 

In comparison with other four competing trackers, improved 
CETs like MUSTer and SAME still perform well in 00TB100, 
despite that the best results in precision plots of OPE and 
TRE are achieved by MEEM. This may because its restoration 
method can help MEEM quickly re-locate the target after 
drifting. 

Besides the plots, additional statistical data of trackers under 
OPE can be found in Table HIl where average Center Location 
Error (CLE), average Overlap Score (OS) and average speeds 
of the tested trackers are presented. According to the results, 
MUSTer and MEEM deliver the most precise results, and 
higher overlap scores are achieved by MUSTer, RPT and 
SAMP. By further observing the speeds of presented trackers, 
it can be found that better scores are often carried out by 
slower trackers, which suggests that acceleration is still an 
undergoing topic. 

C. Qualitative Evaluation 

To evaluate the actual tracking results, we have randomly se¬ 
lected 20 video sequences from 00TB-100 and used them for 
qualitative evaluation. These videos include Couple, CarScale, 
Bolt, Card, Liquor, Board, Walking!, Singerl, Soccer, Carl, 
Girl2, Sylvester, MotorRolling, Jogging!, Rubik, Ereemanl, 


EaceOcd, Deer, Basketball. In these videos, all the challenge 
attributes are properly included. Top-6 trackers in 00TB100, 
which are MUSTer, RPT, SAMP, MEEM, DSST and KCE, are 
selected for comparison, whose predicted boxes at the seven- 
eighth of the tested sequences are presented in Pigure [b] 

Por KCE tracker, its limitation of using fixed windows can 
be revealed in CarScale and Singerl, where the predicted 
boxes are obviously incompatible with the target. By relieving 
the scaling issue, DSST and SAMP track better in sequences 
like Card and CarScale. However, lacking of long-term com¬ 
ponent has made them unable to re-locate missed targets, such 
as the failures in Ereemanl. With long-term consideration, the 
results of MUSTer are much better. Por example, while other 
tested trackers lose the target in MotorRolling, MUSTer is able 
to find the target and restart tracking. The only failed sequence 
for MUSTer is the Board, in which background objects are 
quite similar to the target. In addition to MUSTer, MEEM 
also has the ability to re-locate the target, but its performance 
is limited when dealing with light variations regarding to the 
failure in Singerl. Por RPT, as discussed in the attribute-based 
evaluations, its tracking results become worse if the video 
quality is rather low, and RPT fails in Bolt from the very 
beginning, which suggests that the initialization may be crucial 
for RPT. 

Overall, although the qualitative performance of evaluated 
trackers is promising, there is still plenty of room to improve 
the robustness of CETs as well as other trackers. The imple¬ 
mentation of MUSTer may provide a promising direction for 
further research. 

VI. Conclusion 

In this paper, we have reviewed numerous correlation filter- 
based tracking algorithms, and have conducted comprehensive 
experiments have been conducted. According to the experi¬ 
mental results, the efficiency and robustness of tracking with 
correlation filters have been verified. In specific, state-of-art 
performance can be achieved by improved CETs like MUSTer, 
RPT, SAMP and DSST. Eurthermore, MUSTer has shown to 
be the most robust tracker in the experiments. 

To further improve CETs, some valuable points can be 
concluded based on the obtained results. Eirst, it can be found 
that the possibilities of drifting are significantly reduced with 
powerful features, as shown by improved performance of CN 
and KCE. It is worthy of trying more types of features. Next, 
solving the scaling issue is also a vital direction for improving 
CETs. Although experimental results have shown that both 
scaling pool and part-based methods are helpful for handling 
scale variations, approaches with less processing time are 
favorable. Regarding to the part-based tracking strategy, its 
advantages have not been fully exploited in RPT according to 
its performance. A more effective mechanism to fuse tracking 
results of different parts is desired. Lastly, introducing efficient 
long-term tracking methods is another promising direction. 
With the outstanding performance brought by MUSTer, long¬ 
term tracking is shown to be the most helpful complement of 
correlation filter-based tracking. Purther studies can be made 
on improving the accuracy of long-term tracking component. 
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Car4 


Basketball 


Board 


Bolt 


Carl 


CarScale Couple David Deer FaceOccl 



Freemanl Girl2 Jogging-2 Liquor MotorRolling 




Fig. 6. Qualitative results of 6 selected trackers in 20 sequences. The name of each sequence is located on top of the corresponding figure. Frames are 
collected from seven-eighth of the sequences. 


developing more appropriate architectures and accelerating the 
overall system. 
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